CSC 453

Winter 2026 Day 12

Admin

  • Quiz due 2/20
  • Assignment 3 due 2/23

Virtual memory

Demand paging

  • We can load pages into memory only when they are needed, which is called demand paging
  • Here we see why the valid bit is useful
    • Valid implies “allocated and in memory”
    • Invalid implies “not allocated or not in memory”

Demand paging (cont’d)

  • If process has a page fault, we trap to the OS, which
    1. interrupts the process
    2. finds a free frame (if one exists)
    3. requests the block from the backing store or swap space
    4. brings the page into memory
    5. updates the page table
    6. restarts the processes
  • Non-resident pages may have never been in memory (so in the file system) or were once resident and swapped out (to swap space)

Pure demand paging

  • Most extreme scheme
  • In pure demand paging, all pages are non-resident at the start of execution
  • This is not common in practice, as it can lead to a large number of page faults at the start of execution, which can significantly degrade performance

Working set

  • To mitigate the performance issues of pure demand paging, we can use the concept of a working set
  • Takes advantage of the principle of locality: processes tend to access a relatively small set of pages over a short period of time
  • Working-set model:
    • Define a window \(\Delta\) (most recent page references)
    • If a page is referenced within the window, it’s in the working set; otherwise, it’s not
  • Allows for prepaging pages that are likely to be used soon, improving performance by reducing page faults
    • just outside of the working set, the OS can see “the working set is expanding in this direction, so let’s prefetch some pages in that direction”

Thrashing

  • If the working set of a process exceeds the available physical memory, the process will experience a high rate of page faults, leading to a condition called thrashing
  • When the page fault time exceeds the time spent executing, we are said to be thrashing
  • Problem: CPU scheduler sees the decreasing CPU utilization and increases the degree of multiprogramming; makes thrashing worse
  • Solutions?
    • Local replacement algorithm (each process can only select victims from its set of allocated frames)
    • i.e. cannot steal frames from another process and cause the latter to thrash as well
    • Decrease the degree of multiprogramming (reduce the number of processes)

Optimizing shared pages

  • Remember how fork() and exec() work?
    • Fork leads to a copy of the parent’s address space
    • … but exec() means that a lot of children immediately replace, making a copy wasteful
  • What should we do?
    • Share the pages initially
    • If either the parent or the child writes to the shared page, then you actually make a copy (copy-on-write)
    • Copy-on-write significantly speeds up fork operation; particularly if they will soon exec

What isn’t clear?

Comments? Thoughts?

Real-World Memory

Questions to consider

  • How can we increase TLB reach without making the TLB itself larger?
  • What memory security features does the OS and hardware provide to protect against attacks?
  • How do real architectures (IA-32, x86-64, ARM) handle multiple page sizes and the transition beyond 32-bit addressing?

TLB reach

  • We’ve talked about the hit ratio of a TLB
    • Influenced by the size of the TLB
  • Another metric: TLB reach (i.e., how much memory is addressable from the TLB)
    • Num TLB entries * page size
    • Ideal world: entire working set for a process fits in the TLB
  • It’d be great to simply grow the TLB, but that’s expensive
  • What else can we do?

Huge pages

  • As we discussed, page tables can be large, and TLBs are small. This can lead to a high TLB miss rate, which can significantly degrade performance.
  • One solution is to use larger page sizes, known as huge pages
  • By using huge pages (e.g., 2MB instead of 4KB [can go up to 1GB]), we can reduce the number of pages, and thus the size of the page table, and increase the TLB hit rate
  • Newer kernels support transparent huge pages, which automatically use huge pages when possible without requiring changes to applications
  • Danger: internal fragmentation can be much worse with huge pages, so they are not ideal for all workloads

Linux memory security

  • Linux provides several features to enhance memory security, such as:
    • Address Space Layout Randomization (ASLR): randomizes the memory addresses used by a process, making it more difficult for attackers to predict where code or data is located
    • Kernel Address Space Layout Randomization (KASLR): randomizes the memory addresses used by the kernel
    • Data Execution Prevention (DEP): marks certain areas of memory as non-executable, preventing code from being executed from those regions (e.g., the stack)
      • NX bit: hardware support for DEP held in the page table entry

Intel IA-32 architecture

  • Supports both segmentation and segmentation with paging
    • Each segment can be 4GB
    • Up to 16K segments per process
    • Divided into two partitions
      • First partition of up to 8 K segments are private to process (kept in local descriptor table (LDT))
      • Second partition of up to 8K segments shared among all processes (kept in global descriptor table (GDT))

Intel IA-32 architecture (cont’d)

  • CPU generates logical address
    • Selector given to segmentation unit
      • Which produces linear addresses
    • Linear address given to paging unit
      • Which generates physical address in main memory
      • Paging units form equivalent of MMU

Intel IA-32 architecture (cont’d)

  • 4KB and 4MB pages
    • 4KB pages follow what we have talked about
  • How do we do 4MB, and why?
    • 4MB pages are directly pointed to by the page directory (top level in the page table hierarchy)
    • Benefits?
      • Avoids an extra memory access

IA-32 4KB pages (as we’ve seen)

IA-32 4MB pages (only top-level of page table used)

IA-32 architecture (cont’d)

  • Recall that using a classic 32-bit address space we are limited to 4GB addressable space
    • Fine in the 90s, became decidedly not fine long ago
  • How can we fix this?

32-bit Physical Address Extension (PAE)

  • Paging went to a 3-level scheme
  • Page-directory and page-table entries moved to 64-bits in size (used to be 32-bits)
    • The size of the table remains the same, so we go from 1024 4-byte entries to 512 8-byte entries
    • That gives us an extra two bits (offset still 12-bit, 9-bits for top level 512 entries, 9-bits for second level 512 entries, 2-bits left over), used to refer to a page directory pointer table
  • Net effect is increasing total system address space (originally limited to 36-bits [64GB])
  • But, each process is still limited to 4GB of virtual address space (32-bit addresses)

32-bit PAE

PAE (cont’d)

  • Problem solved?
    • For linux, yes. Problem solved
    • Windows? Of course not, Microsoft decided that you should pay extra for the privilege of using the hardware you own

32-bit PAE question

  • You’re working with a 32-bit IA-32 system that has PAE enabled. The system uses:
    • 2-level paging: PDPT → Page Directory → Page
    • Page size: 2 MB
    • Each page table entry (PTE) is 8 bytes
  • How many entries are in the PDPT, page directory?
  • Break the virtual address into its components (PDPT index, page directory index, page offset)
  • How many total pages can be addressed by this system?
  • What is the maximum addressable physical memory for a single process?
  • What is the maximum addressable physical memory for the entire system?

32-bit PAE question

  • How many entries are in the PDPT, page directory? 4, 512
  • Break the virtual address into its components (PDPT index, page directory index, page offset) 2 bits, 9 bits, 21 bits
  • How many total pages can be addressed by this system? 2048 (4 * 512)
  • What is the maximum addressable physical memory for a single process? 4GB
  • What is the maximum addressable physical memory for the entire system? 64GB

x86-64 architecture

  • 64 bits is enormous (> 16 exabytes)
  • In practice only implement 48 bit addressing (256 TB)
    • Page sizes of 4 KB, 2 MB, 1 GB
    • Four levels of paging hierarchy
  • Can also use PAE-like mechanism (“long mode”) so virtual addresses are 48 bits and physical addresses are 52 bits (4 PB)

ARM32 architecture

  • Dominant mobile platform chip (iOS and Android devices for example)
    • 4 KB and 16 KB pages (changed to 4 KB and 64 KB in ARMv7)
    • 1 MB and 16 MB pages (termed sections)
  • One-level paging for sections, two-level for smaller pages

  • Two levels of TLBs
    • Outer level has two micro TLBs (one data, one instruction)
    • Inner is single main TLB
    • First outer is checked, on miss inner is checked, and on that miss page table walk performed by CPU

What isn’t clear?

Comments? Thoughts?

Kernel memory management

Questions to consider

  • Why does the kernel need physically contiguous memory, and how does this differ from user-space allocation?
  • How does the buddy system handle allocation and deallocation, and what are its trade-offs?
  • Why is a slab allocator useful for small, frequently used kernel objects?

Linux kernel memory management

  • Treated differently from user memory
    • Often allocated from a free-memory pool
    • Kernel requests memory for structures of varying sizes
    • Some kernel memory needs to be contiguous
      • i.e., for device I/O

Linux kernel memory management (cont’d)

  • Main idea: allocate memory from fixed-size segment consisting of physically-contiguous pages
  • Buddy system allocator
    • Memory is allocated in blocks of size \(2^k\) (for some integer \(k\))
    • When a request for memory of size \(s\) is made, the allocator finds the smallest block of size \(2^k\) that can accommodate \(s\)
    • If the block is larger than needed, it is split into two “buddies” of size \(2^{k-1}\) each
    • When a block is freed, the allocator checks if its buddy is also free; if so, they are coalesced back into a larger block

Buddy system allocator (cont’d)

  • Advantages?
    • Simple and efficient for allocating memory of varying sizes
    • Can quickly coalesce free blocks
  • Disadvantages?
    • Fragmentation
      • Internal fragmentation: if a request is for 5KB, we need to allocate an 8KB block, wasting 3KB
      • External fragmentation: over time, as blocks are allocated and freed, we can end up with many small free blocks that cannot be coalesced into larger blocks, making it difficult to satisfy larger allocation requests
    • Not ideal for all workloads, particularly those with many small allocations

Slab allocator

  • Linux also uses a slab allocator for small objects
  • Slab allocator maintains caches of pre-allocated memory chunks for frequently used, fixed-size objects (e.g., file descriptors, process control blocks)
  • When an object is needed, the allocator can quickly provide a chunk from the appropriate cache, reducing fragmentation and improving performance for small allocations

What isn’t clear?

Comments? Thoughts?

Midterm 1 boundary